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Abstract 

Supporting top-fc document retrieval queries on general text databases, that is, finding the 
fc documents where a given pattern occurs most frequently, has become a topic of interest with 
practical applications. While the problem has been solved in optimal time and linear space, the 
actual space usage is a serious concern. In this paper we study various reduced-space structures 
that support top-fc retrieval and propose new alternatives. Our experimental results show that 
our novel algorithms and data structures dominate almost all the space/time tradeoff. 
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1 Introduction 



Ranked document retrieval is the basic task of most search engines. It consists in preprocessing 
a collection of d documents, T> = {D\, D2, ■ ■ ■ , D^}, so that later, given a query pattern P and a 
threshold k, one quickly finds the k documents where P is "most relevant". 

The best known application scenario is that of documents being formed by natural language 
texts, that is, sequences of words, and the query patterns being words, phrases (sequences of words), 
or sets of words or phrases. Several relevance measures are used, which attempt to establish the 
significance of the query in a given document [3]. The term frequency, that is, the number of times 
the pattern appears in the document, is the main component of most measures. 

Ranked document retrieval is usually solved with some variant of a simple structure called an 
inverted index [27} 13] . This structure, which is behind most search engines, handles well natural 
language collections. However, the term "natural language" hides several assumptions that are key 
to the efficiency of that solution: the text must be easily tokenized into a sequence of words, there 
must not be too many different words, and queries must be whole words or phrases. 

Those assumptions do not hold in various applications where document retrieval is of interest. 
The most obvious ones are documents written in Oriental languages such as Chinese or Korean, 
where it is not easy to split words automatically, and search engines treat the text as a sequence 
of symbols, so that queries can retrieve any substring of the text. Other applications simply do 
not have a concept of word, yet ranked retrieval would be of interest: DNA or protein sequence 
databases where one seeks the sequences where a short marker appears frequently, source code 
repositories where one looks for functions making heavy use of an expression or function call, MIDI 
sequence databases where one seeks for pieces where a given short passage is repeated, and so on. 

These problems are modeled as a text collection where the documents Di are strings over an 
alphabet E, of size a, and the queries are also simple strings. The most popular relevance measure 
is the plain term frequency, that is, the number of occurrences of the string P in the strings -Dj0 
We call n = J2 the collection size and m = \P\ the pattern length. 

Muthukrishnan [20] pioneered the research on document retrieval for general strings. He solved 
the simpler problem of "document listing": report the occ distinct documents where P appears in 
optimal time 0(m + occ) and linear space, 0(n) integers (or 0(n log n) bits). Muthukrishnan also 
considered various other document retrieval problems, but not top-fc retrieval. 

The first efficient solution for the top-k retrieval problem was introduced by Hon, Shah, and 
Wu [15j . They achieved 0(m + log n log log n + k) time, yet the space was superlinear, 0(n log 2 n) 
bits. Soon, Hon, Shah, and Vitter [TJ] achieved 0(m + klogk) time and linear space, 0(n log n) 
bits. Recently, Navarro and Nekrich [21J achieved optimal time, 0(m + k), and reduced the space 
from 0(n log n) to 0{n(\ogo + logcZ)) bits (albeit the constant is not small). 

While these solutions seem to close the problem, it turns out that the space required by 
0(n log n)-bit solutions is way excessive for practical applications. A recent space-conscious imple- 
mentation of Hon et al.'s index [23] showed that it requires at least 5 times the text size. 

Motivated by this challenge, there has been a parallel research track on how to reduce the space 
of these solutions, while retaining efficient search time [Ml [251 CEH El HI E21 [13]. In this work we 
introduce a new variant with relevant theoretical and practical properties, and show experimentally 
that it dominates previous work. The next section puts our contribution in context. 

1 It is usual to combine the term frequency with the so-called "inverse document frequency", but this makes a 
difference only in the more complex bag-of-word queries, which have not yet been addressed in this context. 
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2 Related Work 



Most of the data structures for general text searching, and in particular the classical ones for 
document retrieval [20], [H], build on on suffix arrays [18] and suffix trees [26] Q]. Regard the 
collection T> as a single text T[l,n] = D>\D>i ■ ■ ■ Dj, where each Di is terminated by a special 
symbol "$". A suffix array is a permutation of the values [l,n] that points to all the 

suffixes of T: A[i] points to the suffix TL4[i], n]. The suffixes are lexicographically sorted in A: 
T[A[i],n] < T[A[i+l],n] for all 1 < i < n. Since the occurrences of any pattern P in T correspond 
to suffixes of T that are prefixed by P, the occurrences are pointed from a contiguous area in the 
suffix array A[sp,ep\. A simple binary search finds sp and ep in 0(m log n) time [18]. A suffix 
tree is a digital tree with 0(n) nodes where all the suffixes of T are inserted and unary paths are 
compacted. Every internal node of the suffix tree corresponds to a repeated substring of T and its 
associated suffix array interval;, suffix tree leaves correspond to the suffixes and their corresponding 
suffix array cells. A top-down traversal in the suffix tree finds the internal node (called the locus of 
P) from where all the suffixes prefixed with P descend, in 0(m) time. Once sp and ep are known, 
the top-A; query finds the k documents where most suffixes in A[sp, ep] start. 

A first step towards reducing the space in top-/c solutions is to compress the suffix array. Com- 
pressed suffix arrays (CSAs) simulate a suffix array within as little as nHk(T) + o(n logo") bits, for 
any k < a log a n and any constant < a < 1. Here H^T) is the A;-th order entropy of T [19] . a mea- 
sure of its statistical compressibility. The CSA, using |C5-A| bits, finds sp and ep in time search(m), 
and computes any cell A[i], and even j4 -1 [z], in time lookup(n). For example, a CSA achieving the 
small space given above [8] achieves search(m) = 0(m(l + lo g°iog n )) and lookup(n) = 0(log 1+€ n) 
for any constant e > 0. CSAs also replace the collection, as they can extract any substring of T. 

In their very same foundational paper, Hon et al. [13] proposed an alternative succinct data 
structure to solve the top-fc problem. Building on a solution by Sadakane [24] for document listing, 
they use a CSA for T and one smaller CSA for each document Di, plus little extra data, for a 
total space of 2|C5A| + o(n) + dlog(n/d) + 0(d) bits. They achieve time 0(search(m) + k log 3+e n ■ 
lookup(n)), for any constant e > 0. Gagie, Navarro, and Puglisi [9] slightly reduced the time 
to 0(search(m) + k log d\og(d/k) log 1+e n ■ lookup(n)), and Belazzougui and Navarro [4] further 
improved it to 0(search(m) + k log k \og(d/k) log 6 n • lookup(n)). 

The essence of the succinct solution by Hon et al. [14] is to preprocess top-/c answers for the 
lowest suffix tree nodes containing any range A[i- g,j ■ g] for some sampling parameter g. Given the 
query interval A[sp, ep], they find the highest preprocessed suffix tree node whose interval [sp',ep'] 
is contained in [sp,ep\. They show that sp' — sp < g and ep — ep' < g, and then the cost of 
correcting the precomputed answer using the extra occurrences at A[sp, sp'—l] and A[ep'+1, ep] is 
bounded. For each such extra occurrence A[i], one finds out its document, computes the number of 
occurrences of P within that document, and lets the document compete in the top-fc precomputed 
list. Hon et al. use the individual CSAs and other data structures to carry out this task. The 
subsequent improvements [9] d] are due to small optimizations on this basic design. 

Gagie et al. [9] also pointed out that in fact Hon et al.'s solution can run on any other data 
structure able to (1) telling which is the document corresponding to a given A[i], and (2) counting 
how many times does the same document appear in any interval A[sp, ep]. A structure that is 
suitable for this task is the document array D[l,n], where D[i] is the document A[i] belongs to |20j. 
While in Hon et al.'s solution this is computed from A[i] using d\og(n/d) + 0(d) extra bits [24] , we 
need more machinery for task (2). A good alternative was proposed by Makinen and Valimaki [25] 
in order to reduce the space of Muthukrishnan's document listing solution [20]. The structure is a 
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wavelet tree [12] on D. The wavelet tree represents D using nlogd + o(n) logd bits and not only 
computes any D[i] in O(logd) time, but it can also compute operation ranki(D,j), which is the 
number of occurrences of document i in D[l, j], within the same time. This solves operation (2) as 
rankz)\i](D , ep) —rankp^D , sp—1). With the obvious disadvantage of the considerable extra space 
to represent D, this solution changes lookup(n) by logd in the query time. Gagie et al. show many 
other combinations that solve (1) and (2). One of the fastest uses Golynski et al.'s representation 
on D and, within the same space, changes lookup(n) to log log d in the time. Very recently, 
Hon, Shah, and Thankachan [13] presented new combinations in the line of Gagie et al., using also 
faster CSAs. The least space-consuming one requires n log d + n o(log d) bits of extra space on top 
of the CSA of T, and improves the time to 0(search(m) + k(logk + (log log n) 2+e )). 

Belazzougui and Navarro [3] used an approach based on minimum perfect hash functions to 
replace the array D by a weaker data structure that takes 0(nlogloglog(i) bits of space and 
supports the search in time O (search (m)+k log k log 1+<E n-lookup(n)). This is solution is intermediate 
between representing D or the individual CSAs and it could have practical relevance. 

Culpepper, Navarro, Puglisi, and Turpin [7] built on an improved document listing algorithm on 
wavelet trees [10J to achieve two top-A; algorithms, called Quantile and Greedy, that use the wavelet 
tree alone (i.e., without Hon et al.'s [14] extra structures). Despite their worst-case complexity 
being as bad as extracting one by one the results in A[sp, ep], that is, 0((ep — sp + l)logd), in 
practice the algorithms performed very well, being Greedy superior. They implemented Sadakane's 
solution [24J of using individual CSAs for the documents and showed that the overheads are very 
high in practice. Navarro, Puglisi, and Valenzuela [22] arrived at the same conclusion, showing 
that Hon et al.'s original succinct scheme is not promising in practice: both space and time were 
much higher in practice than Culpepper et al.'s solution. However, their preliminary experiments 
|22] showed that Hon et al.'s scheme could compete when running on wavelet trees. 

Navarro et al. |22] also presented the first implemented alternative to reduce the space of wavelet 
trees, by using Re-Pair compression [17] on the bitmaps. They showed that significant reductions 
in space were possible in exchange for an increase in the response time of Culpepper et al.'s Greedy 
algorithm (half the space and twice the time is a common figure). 

This review exposes interesting contrasts between the theory and the practice in this area. On 
one hand, the structures that are in theory larger and faster (i.e., the nlogci-bits wavelet tree 
versus a second CSA of at most n log a bits) are in practice smaller and faster. On the other hand, 
algorithms with no worst-case bound (Culpepper et al.'s [7]) perform very well in practice. Yet, the 
space of wavelet trees is still considerably large in practice (about twice the plain size of T in several 
test collections |22j). especially if we realize that they represent totally redundant information that 
could be extracted from the CSA of T. 

In this paper we study a new practical alternative. We use Hon et al.'s [13] succinct structure 
on top of a wavelet tree, but instead of brute force we use a variant of Culpepper et al.'s [7] 
method to find the extra candidate documents in A[sp, sp'—l] and Afep'+l, ep]. We can regard the 
combination as either method boosting the other. Culpepper et al. boost Hon et al.'s method, while 
retaining its good worst-case complexities, as they find the extra occurrences more cleverly than 
by enumerating them all. Hon et al. boost plain Culpepper et al.'s method by having precomputed 
a large part of the range, and thus ensuring that only small intervals have to be handled. 

We consider the plain and the compressed wavelet tree representations, and the straightforward 
and novel representations of Hon et al.'s succinct structure. We compare these alternatives with the 
original Culpepper et al.'s method (on plain and compressed wavelet trees), to test the hypothesis 
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that adding Hon et al.'s structure is worth the extra space. Similarly, we include in the comparison 
the basic Hon et al.'s method (with their structure compressed or not) over Golynski et al.'s 
sequence representation, to test the hypothesis that using Culpepper et al.'s method over the 
wavelet tree is worth compared to the brute force method over the fastest sequence representation 
This brute force method is also at the core of the new proposal by Hon et al. [13J . 
Our experiments show that our new algorithms and data structures dominate almost all the 
space/time tradeoff for this problem, becoming a new practical reference point. 

3 Implementing Hon et al.'s Succinct Structure 

The succinct structure of Hon et al. [H] is a sparse generalized suffix tree of T (SGST; "generalized" 
means it indexes d strings). It is obtained by cutting j4[l,n] into blocks of length g and sampling 
the first and last cell of each block (recall that cells of A are leaves of the suffix tree). Then all the 
lowest common ancestors (lea) of pairs of sampled leaves are marked, and a tree r\. is formed with 
those (at most) 2n/g marked internal nodes. The top-A: answer is stored for each marked node, 
using 0((n/g)k logn) bits. This is done for k = 1,2, 4, . . ., and parameter g is of the form g = k-g' . 
The final space is 0((n/g') log cilog n) bits. This is made o(n) by properly choosing g' . 

To answer top-fc queries, they search the CSA for P, to obtain the suffix range A[sp, ep] of the 
pattern. Then they turn to the closest higher power of two of k, k* = 2r iogfc l, and let g = k* ■ g' 
be the corresponding g value. They now find the locus of P in the tree r^* by descending from 
the root until finding the first node v whose interval [sp v ,ep v ] is contained in [sp, ep]. They have 
at v the top-k candidates for [sp v ,ep v ] and have to correct the answer considering [sp, sp v — 1] and 
[ePu+lj ep]. Now we introduce two implementations of this idea. 

3.1 Sparsified Generalized Suffix Tree (SGST) 

Let us call li = A[i] the i-th leaf. Given a value of k we define g = k ■ g' , for a space/time tradeoff 
parameter </, and sample n/g leaves h, l g +i, h g +i, ■ ■ ■, instead of sampling 2n/g leaves as in the 
theoretical proposal. We mark internal SGST nodes lca(l\, l g +i), lca(l g+ \, hg+i), ■ ■ ■■ It is not hard 
to prove that any v = lca(li g+ \,lj g+ {) is also v = lca(l rg+ \, l( r +i) g +i) f° r some r (more precisely, r 
is the rightmost sampled leaf descending from the child of v that is an ancestor of h g +i). Therefore 
these n/g SGST nodes are sufficient and can be computed in linear time [5]. 

Now we note that there is a great deal of redundancy in the log d trees t^, since the nodes of r^k 
are included in those of r^, and the 2k candidates stored in the nodes of r 2 fc contain those in the 
corresponding nodes of t/u. To factor out some of this redundancy we store only one tree r, whose 
nodes are the same of t%, and record the class c(v) of each node v E r. This is c(v) = maxjfc, v € r^} 
and can be stored in log log d bits. Each node v S r stores the top-c(u) candidates corresponding 
to its interval, using c(v) log d bits, and their frequencies, using c(v) log n bits, plus a pointer to 
the tabki, and the interval itself, [sp v , ep v ], using 2 logn bits. All the information on intervals and 
candidates is factored in this way, saving space. Note that the class does not necessarily decrease 
monotonically in a root-to-leaf path of r, thus we store all the topologies independently to allow 
for efficient traversal of the t\~ trees, for k > 1. Apart from topology information, each node of such 
Tfc trees contains just a pointer to the corresponding node in r, using log |t| bits. 

2 Actually, an index to a big table where all these small tables are stored consecutively. 
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In our first data structure, the topology of the trees t and t& is represented using pointers of 
log |r| and log|rfc| bits, respectively. To answer top-/c queries, we find the range A[sp, ep] using 
a CSA (whose space and negligible time will not be reported because it is orthogonal to all the 
data structures). Now we find the locus in the appropriate tree r^. top-down, binary searching 
the intervals [sp v ,ep v ] of the children of the current node, and extracting those intervals using the 
pointer to r. By the properties of the sampling |14j it follows that we will traverse in this descent 
nodes v 6 r^* such that [sp, ep] C [sp v ,ep v ], until reaching a node v so that [sp v ,ep v ] = [sp',ep'] C 
[sp, ep] C [sp' — g,ep' + g] (or reaching a leaf u € such that [sp, ep] C [sp u ,ep u ], in which case 
ep — sp + 1 < 2g). This v is the locus of P in r^,*, and we find it in time 0(m logo"). This time is 
negligible compared to the subsequent costs, as well as is the search using the CSA. 

3.2 Succinct SGST 

Our second implementation uses a pointerless representation of the tree topologies. Although the 
tree operations are slightly slower than on a pointer-based representation, this slowdown occurs on 
a not too significant part of the search process, and a succinct representation allows one to reduce 
the sampling parameter g for the same space usage. 

Arroyuelo et al. [2] showed that, for the functionality it provides, the most promising succinct 
representation of trees is the so-called Level-Order Unary Degree Sequence (LOUDS) [16J. It 
requires 2N + o(N) bits of space (in practice, as little as 2.1 N) to represent a tree of N nodes, and 
it solves many operations in constant time (less than a microsecond in practice). 

We use that implementation [2]. The shape of the tree is stored using a single binary sequence, 
as follows. Starting with an empty bitstring, every node is visited in level order starting from the 
root. Each node with c children is encoded by writing its arity in unary, that is, l c is appended to 
the bitstring. Each node is identified with the position in the bitstring where the encoding of the 
node begins. We store the values sp v and ep v in a separate array, indexed by the position of the 
node v in the bitstring. Other node data such as pointers to r (in r^) and to the candidates (in r) 
are stored in the same way. The space can be further reduced by storing only the identifiers of the 
candidates, and their frequencies are computed on the fly using rank on the wavelet tree of D. 

4 A New Top- A; Algorithm 

We run a combination of the algorithm by Hon et al. [14] and those of Culpepper et al. [7], over 
a wavelet tree representation of the document array D[l,n]. Culpepper et al. introduce, among 
others, a document listing method (DFS) and a Greedy top-/c heuristic. We adapt these to our 
particular top-k subproblem. 

If the search for the locus of P ends at a leaf u that still contains the interval [sp,ep], Hon et 
al. simply scan A[sp, ep] by brute force and accumulate frequencies. We use instead Culpepper et 
al.'s Greedy algorithm which is always better than a brute- force scanning. 

When, instead, the locus of P is a node v where [sp v ,ep v ] = [sp',ep r ] C [sp,ep], we start with 
the precomputed answer of the k < k* most frequent documents in [sp',ep'], and update it to 
consider the subintervals [sp,sp'—l] and [ep'+l, ep]. We use the wavelet tree of D to solve the 
following problem: Given an interval I?[Z,r], and two subintervals [/i,n] and [/2) r 2] 5 enumerate 
all the distinct values in [/i,ri] U [h,^] together with their frequencies in [l,r]. We propose two 
solutions, which can be seen as generalizations of heuristics proposed by Culpepper et al. [7j- 



6 



sp ep 
sp' ep' 



3 


1 


8 5 7 1 


8 


7 


1 


4 


6 


7 


2 


7 


2 


7 












1 








1 


1 





1 





1 



3 1114 2 2 


8 


5 


7 


8 


7 


6 


7 


7 


1 1 


1 





1 


1 


1 





1 


1 



1 


1 


1 


2 


2 


3 4 


5 6 


8 7 8 


7 


7 


7 











1 


1 


1 


1 


1 1 












7 




Figure 1: Restricted DFS to obtain the frequencies of documents not covered by Shaded regions 
show the interval [sp, ep] = [4, 14] mapped to each wavelet tree node. Dark shaded intervals are 
the projections of the leaves not covered by [sp', ep'] = [7, 11]. 



4.1 Restricted Depth-First Search (DFS) 

Figure 14.11 illustrates a wavelet tree representation of an array D (ignore colors for now) . At the 
root, a bitmap l?[l,n] stores B[i] = if D[i] < d/2 and B[i] = 1 otherwise. The left child of the 
root is, recursively, a wavelet tree handling the subsequence of D with values D[i] < d/2, and the 
right child handles the subsequence of values D[i] > d/2. Only the bitmaps B are actually stored. 
Added over the log d levels, the wavelet tree requires n log d bits of space. With o(n log d) additional 
bits we answer in constant time any query ranko/i(B,i) over any bitmap B [16] . 

Note that any interval D[i,j] can be projected into the left child of the root as [io>Jo] = 
[ranko(B, i— 1)+1, ranko(B, j)], and into its right child as [«i,ji] = [ranki(B,i—l)+l,ranki(B,j)], 
where B is the root bitmap. Those can then be projected recursively into other wavelet tree nodes. 

Our restricted DFS algorithm begins at the root of the wavelet tree and tracks down the intervals 
[l,r] = [sp,ep], [h,ri] = [sp, sp'—l], and ^2,^2] = [ep'+l, ep]. More precisely, we count the number 
of zeros and ones in B in ranges [Zi,7"i] U ^2^2], as well as in [l,r], using a constant number of 
rank operations on B. If there are any zeros in [Zi,ri] U [Z25 r 2] 5 we map all the intervals into the 
left child of the node and proceed recursively from this node. Similarly, if there are any ones in 
[/i,ri] U [/2> r 2]) we continue on the right child of the node. When we reach a wavelet tree leaf we 
report the corresponding document, and the frequency is the length of the interval [I, r] at the leaf. 
Figure FPU shows an example where we arrive at the leaves of documents 1, 2, 5 and 7, reporting 
frequencies 2, 2, 1 and 4, respectively. 

When solving the problem in the context of top-/c retrieval, we can prune some recursive calls. 
If, at some node, the size of the local interval [I, r] is smaller than our current kth candidate to the 
answer, we stop exploring its subtree since it cannot contain competitive documents. 

4.2 Restricted Greedy 

Following the idea described by Culpepper et al., we can not only stop the traversal when [l,r] is 
too small, but also prioritize the traversal of the nodes by their [I, r] value. 

We keep a priority queue where we store the wavelet tree nodes yet to process, and their intervals 
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[l,r], [h,rx], and [foj^]. The priority queue begins with one element, the root. Iteratively, we 
remove the element with highest i — l+l value from the queue. If it is a leaf, we report it. Else, 
we project the intervals into its left and right children, and insert each such children containing 
nonempty intervals [Zi,ri] or [^^J into the queue. As soon as the r—l+1 value of the element we 
extract from the queue is not larger than the kth frequency known at the moment, we can stop. 

4.3 Heaps for the k Most Frequent Candidates 

Our two algorithms solve the query assuming that we can easily know at each moment which is 
the kth best candidate known up to now. We use a min-heap data structure for this purpose. It 
is loaded with the top-k precomputed candidates corresponding to the interval [sp',ep']. At each 
point, the top of the heap gives the kth known frequency in constant time. Given that the previous 
algorithms stop when they reach a wavelet tree node where r—l+1 is not larger than the kth known 
frequency, it follows that each time the algorithms report a new candidate, this is more frequent 
than our kth known candidate. Thus we replace the top of our heap with the reported candidate 
and reorder the heap (which is always of size k, or less until we find k distinct elements in D[sp, ep\). 
Therefore each candidate reported costs 0(logd + logfe) time (there are also steps that do not yield 
any result, but the overall upper bound is still 0{g(\ogd + log A;))). 

A remaining issue is that we can find again, in our DFS or Greedy traversal, a node that was 
in the original top-fc list, and thus possibly in the heap. This means that the document had been 
inserted with its frequency in D[sp',ep'], but since it appears more times in D[sp,ep], we must 
now update its frequency, that is, increase it and restore the min-heap invariant. It is not hard to 
maintain a hash table with forward and backward pointers to the heap so that we can track their 
current positions and replace their values. However, for the small k values used in practice (say, 
tens or at most hundreds), it is more practical to scan the heap for each new candidate to insert 
than to maintain all those pointers upon all operations. 

5 Experimental Results 

We test the performance of our implementations of Hon et al.'s succinct structure combined with 
a wavelet tree (as explained, the original proposal is not competitive in practice |22j). 

We used three test collections of different nature: ClueWiki is a 141 MB sample of ClueWeb09, 
formed by 3,334 Web pages from the English Wikipedia; KGS is a 75 MB collection of 18,838 sgf- 
formatted Go game records (http://www.u-go.net/gamerecords); and Proteins is a 60 MB col- 
lection of 143,244 sequences of Human and Mouse proteins ( |http: //www. ebi . ac.uk/swi ssprot). 

Our tests were run on a 4-core 8-processors Intel Xeon, 2Ghz each, with 16GB RAM and 2MB 
cache. We compiled using g++ with full optimization. For queries, we selected 1,000 substrings at 
random positions, of length 3 and 8, and retrieved the top-k documents for each, for k = 1 and 10. 

5.1 Choosing Our Best Variant 

Our first round of experiments compares our different implementations of SGSTs (i.e., the trees r^, 
see Section [3]) over a single implementation of wavelet tree (Alpha, choosing the best value for a in 
each case [22]). We tested a pointer-based representation of the SGST (Ptrs, the original proposal 
|14j). a LOUDS-based representation (L0UDS), our variant of L0UDS that stores the topologies in 
a unique tree r (LIGHT), and our variant of LIGHT that does not store frequencies of the top-fc 
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candidates (XLIGHT). We consider sampling steps of 200 and 400 for g' . For each value of g, we 
obtain a curve with various sampling steps for the rank computations on the wavelet tree bitmaps. 

We also tested different algorithms to find the top-A; among the precomputed candidates and 
remaining leaves (see Section [4]): Our modified greedy (Greedy), our modified depth-first-search 
(DFS) , and the brute-force selection procedure of the original proposal |14| on top of the same wavelet 
tree (Select). As this is orthogonal to the data structures used, we compare these algorithms only 
on top of the Ptrs structure. The other structures will use the best method. 

Figure [2] shows the results. Method Greedy is always better than Select (up to 80% better) 
and DFS (up to 50%), which confirms intuition. Using LDUDS representation instead of Ptr had 
almost no impact on the time. This is because time needed to find the locus is usually negligible 
compared with that to explore the uncovered leaves. Further costless space gains are obtained with 
variant LIGHT. Variant XLIGHT, instead, reduces the space of LIGHT at a noticeable cost in time that 
makes it not so interesting, except on Proteins. In various cases the sparser sampling dominates 
the denser one, whereas in others the latter makes the structure faster if sufficient space is spent. 

To compare with other techniques, we will use variant LIGHT on ClueWiki and KGS, and 
XLIGHT on Proteins, both with g' = 400. This combination will be called generically SSGST. 



5.2 Comparison with Previous Work 

The second round of experiments compares ours with previous work. The Greedy heuristic [7j is 
run over different wavelet-tree representations of the document array: a plain one (WT-Plain) [7], 
a Re-Pair compressed one (WT-RP), and a hybrid that at each wavelet tree level chooses between 
plain, Re-Pair, or entropy-based compression of the bitmaps (WT-Alpha) [22j . We combine these 
with our best implementation of Hon et al.'s structure (suffixing the previous names with +SSGST). 
We also consider variant Goly+SSGST [HI [13], which runs the rank-based method (Select) on top 
of the fastest rank-capable sequence representation of the document array (Golynski et al.'s |11| . 
which is faster than wavelet trees for rank but does not support our more sophisticated algorithms; 



here we used the implementation at http://libcds.recoded.cl). 



Our new structures dominate most of the space-time map. When using little space, variant 
WT-RP+SSGST dominates, being only ocassionally and slightly superseded by WT-RP. When using 
more space, WT-Alpha+SSGST takes over, and finally, with even more space, WT-Plain+SSGST be- 
comes the best choice. Most of the exceptions arise in Proteins, which due to its incompressibility 
|22j makes WT-Plain+SSGST essentially the only interesting variant. The alternative Goly+SSGST is 
no case faster than a Greedy algorithm over plain wavelet trees (WT-Plain), and takes more space. 



6 Future Work 

We can further reduce the space in exchange for possibly higher times. For example the sequence 
of all precomputed top-A; candidates can be Huffman-compressed, as there is much repetition in 
the sets and a zero-order compression would yield space reductions of up to 25% in the case of 
Proteins, the least compressible collection. The pointers to those tables could also be removed, by 
separating the tables by size, and computing the offset within each size using rank on the sequence 
of classes of the nodes in r. Finally, values [sp v ,ep v ] can be stored as [sp v ,ep v — sp v ], using DACs 
for the second components [6], as many such differences will be small. 
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ClueWiki, m=3, k=1 



ClueWiki, m=8, k=10 



Ptrs 200 Greedy - 
Ptrs200 DFS ■ 
Ptrs 200 Select 
Ptrs 400 Greedy - 
Ptrs 400 DFS ■ 
Ptrs 400 Select 
LOUDS 400 
LOUDS 200 
LIGHT 400 - 
LIGHT 200 - 
XLIGHT 400 
XLIGHT 200 - 






8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14 

Size (bpc) 

KGS, m=8, k=10 



14 15 16 17 18 19 20 21 22 

Size (bpc) 

Proteins, m=8, k=1 




19 20 21 22 23 24 25 26 27 28 

Size (bpc) 




Figure 2: Our different alternatives for top-fc queries. On the left for k = 1 and pattern length 
m = 3; on the right for k = 10 and m = 8. 
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ClueWiki, m - 3 



ClueWiki, m = 8 




WT-RP (K=1) - 
WT-RP(K=10) 
WT-RP + SSGST(K=1) 
WT-RP + SSGST (K=10) 

WT-Alpha(K=1) - 
WT-Alpha(K=10) - 
WT-Alpha + SSGST (K=1) ■ 
WT-Alpha + SSGST (K=10) 
WT-Plain (K=1) - 
WT-Plain (K=10) - 
WT-Plain + SSGST (K=1 ) - 
WT-Plain + SSGST (K=10) ■ 
Gcly + SSGST(K=1) 
Goly + SSGST (K-10) 




10 12 14 16 18 20 

Size (bpc) 

KGS, m = 3 




12 14 16 18 20 22 

Size (bpc) 

Proteins, m = 3 





10 12 14 16 

Size (bpc) 




Figure 3: Comparison with previous work. On the left, for m = 3, and on the right, for m 
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